A basic data analysis of the one variable.

by Dr. Leonid Sakharov

A basic statistical data analysis of the one variable includes calculation and explanation of mean, standard deviation, median and absolute deviation.

The fundamental case of the statistics is the analysis of series of numerical measurements of one characteristic of objects the same nature. As special case it can be one object and series of measurements of one its characteristic - for example weight or color or something else. Nowadays it is not easy to find relevant to our everyday life example when someone measures one characteristic of something and makes statistical analysis of it. The reason is that almost all from enormous amount of measurements made every moment are automated or could be made with instruments with such precision that acceptable accuracy even one measurement is a given to rely on.

For most cases for practical practical application of statistics, like for research of customers behavior in marketing, the situation when only one parameter has to be analyzed by itself considered to be trivial as it is indeed. It makes an illustration of basic statistics sort of high school science project. So be it. The is one: analysis of batteries power in my house.

I found all batteries in my apartment (no matter of their type, size, rechargeable or not, used or in use in some device or in stock waiting for future replace. I measured their voltage by multi-meter and put set of data into the array with 55 measurements:

Voltage = array(0.096,0.731,0.442,1.328,1.327,0.67,0.58,1.354,1.358,0.093,1.519,1.586,1.598,1.599,1.544,1.596, 1.595,1.518,1.584,1.58,1.584,1.584,1.581,1.575,1.581,9.05,6.06,7.66,9.01,0.16,0.04,1.574,1.576,1.55,1.549, 1.403,1.406,1.564,1.338,1.525,1.048,1.143,1.439,1.442,1.483,1.483,3.73,3.74,4.16,1.53,1.53,1.53,1.53,1.563,1.563);

(The presentation in form array as it is above is done only for space saving sake. In practice the best way to put collected data into the column of spreadsheet application like MS Excel.)

Minimum and maximum.

The first step of data analysis (here and farther we will omit words "of one parameter") is finding most small and most large value in the data series. In array Voltage shown most small value - Min(Voltage) = 0.04 V; largest Max(Voltage) = 9.05 V.

There are 55 batteries measured. One can conclude from it that it is highly unlikely to find home battery with voltage above 10 V and quite possible to get completely (almost) discharged one.

Next question could be what the most probable value for the voltage of one the battery? As it often happens for sentence in everyday language meaning of words are depend on context. Term "most probable value" in statistics can be defined in different manners each of them are "correct". Most probable value also has a name central tendency.

Mean (arithmetic mean, average).

The value of mean is calculated by formula:

X = (1⁄n)× Σ x_i

(1),

- where x_i - one of the n values in all set of measurements set; Σ x_i - is sum of all measurements. Although it is looks counter-intuitive the value of mean gives the point that produces a minimum of sum of square (not absolute) deviations from the all measured points. The formula (1) for mean can be easy obtained solving differential equation (2):

d[Σ (x_i - X )² ]/dX = 0

(2),

For our example of batteries the numerical value of mean is V = 1.9269455. Considering that value of mean gives more weight to large deviations a reasonable fair prediction for power of randomly taken battery from the set. Nevertheless overestimation of the value could be disproportionately high with ratio 48/7.

The main advantage of the mean for characterization most probable value is simplicity of its calculation and for cases with symmetric distribution of measured values it gives equal ratio for overestimations to underestimations.

Median.

If one think that most important characteristic of most probable point is equal value of overestimations ant underestimations the median is defined to give exact this value. For odd number of measurements median has exact value with equal numbers are above and below it. For even numbers of measurements in set it is average between two points in the middle when again equal numbers are above and below median.

In our example the median of batteries power has the value 1.53. It is practically exact nominal value for new AA and AAA batteries in set but takes completely out of picture even a hint of presence of 9V batteries.

Value of median is much more time-consuming to calculate compare to mean .

Central tendency.

Besides mean and median there are many other values to define of most probable position in distribution of data set that is also called a central point

The solution of equation (2) if modified by replacing square function of deviation by its absolute value and mean X by central point X:

d[Σ |x_i - X |]/dX = 0

(3),

gives central point as minimum of absolute deviation. For our sample about battery set this value is 1.53 - the same as median (that is coincident). The algorithm of this calculation is even more difficult compare to calculation of median. And it brings not too much for analysis anyway.

Additional complications can bring the situation when there are more than one central points for one data set. In our example we know that there are three distinct nominal values of batteries power - 1.5, 3.7 and 9 V that presumes presence at least three distinct central points (mixing apples with oranges). Other exhibit for more then one central point can be observed phenomenon is electron diffraction when probability to detect a particle has several distinct maximums. This situation will be discussed in more details later.

Next question to answer will be how big our mistake could be if we take central point as the best estimation?

Standard deviation.

σ = {(1/n)×Σ (x_i - X )²}^0.5 = (1/n)×[(n× Σx_i² - (Σx_i)² )]^0.5 = [(1/n)× Σx_i² - X² ]^0.5

(4),

formulas (4) above provide fast way to calculate standard deviation that provides average deviation from mean in set of measurements.

There is difference between sample standard deviation and population standard deviation these differ in formulas (4) by term 1/(n-1) or 1/n correspondently. The idea is when one measured subset sample from whole population of objects and going to make conclusion about standard deviation for whole population, including not measured, it is necessary to take into account that some data points from whole population are omitted and could influence result thus deviation for sample population in need to be corrected. Don't ask me why exact by factor n/(n-1). It looks for me that it is more educated guess than hard math. In general it presumes that subset sample is true random. The issue puzzles me and urges to make numerical experiment to investigate. In practice the difference between sample and population standard deviations has so small practical consequences that to my knowledge nobody really investigates the topic. To avoid over complication here and later standard deviation we use will be population standard deviation.

In our battery power example σ = 1.86312 V. What information it gives us? It gives average value of mistake. It is not entirely true. It gives average value of when large mistakes have more weight (because definition as sum of square mistakes). Is it truly important? Maybe sometimes but not for very basic data analysis

There is commonly accepted rule of three sigma. In common words this rule states that mistakes larger than three sigma from average value are so rare that possibility of measurements outside interval X ± 3×σ can be safely neglected and throw out from set. In our example three sigma interval is -3.66241 ÷ 7.51631 and there are three out 55 measurements are outside this interval that is reasonable significant (these are 9 V nominal batteries). The reason is that three sigma rule is working good enough only for true stochastic measurements of one and unique characteristics for objects of the same nature. This means that some characteristic with exact value is measured repeatedly and due to numerous random outside influences each new act of measurement produces number that is slightly divided from true value (if "true" value exists and have physical sense - not always one number can be entirely "true" characteristic of one parameter). Probability distribution of such true stochastic measurements must behave as normal (Gaussian) distribution for density of probability. It is the topic for more advance analysis than discussed here.

Coefficient of variation.

μ = σ/| X|

(5),

Formula (5) produce coefficient of variation that is always positive number. It shows how accurate value of mean is. Practical importance of coefficient of variation defines how many digits in numerical presentation of measured value actually have sense to be presented:

N_digits = round(- log₁₀(μ) +2)

(6),

actually in formula (6) adding 2 digits is done by common sense consideration not to through everything into trash bin if accuracy is not good enough. In our case with mean = 1.9269455 and σ = 1.86312 - coefficient of correlation is μ= 0.966877371 and number of digits is 2. Strictly logical there is no special importance value in second digit here but declaring mean 2 V could be confusing by implying idea that value is integer by nature. It means that the best presentation of our experiment gives the sentence:

power of batteries found in my home is 1.9 ± 1.8 V

It is the best basic data analysis but one has to understand that more advance data analysis can deliver more.

Summary.

There are following steps to do for most basic data analysis of series of measurement of one characteristic:

calculate minimum, maximum, mean (arithmetic average), standard deviation and coefficient of variation of data set.
if coefficient of variation less than 0.03 do present data as mean ± standard deviation written with number of digit as gives formula (6);
otherwise if coefficient of variation larger than 0.03 do deeper analysis;
if absolute deviation of minimum or maximum value from mean are larger than 4×σ do additional analysis too.

The threshold value for coefficient of variation 0.03 is sort of arbitrary common sense nostalgia about of what good experiment in old school hard science should produce - number with acceptable level of accuracy.

Most important. Do know what your numbers correspond in material world. Numbers by themselves could be misleading and they are very often misleading indeed.

The calculator of described above parameters is presented at calculator of basic statistics of one variable.

Oct. 12, 2017; 11:44 EST

Comments: